Unsupervised morphological parsing of Bengali
نویسندگان
چکیده
Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and stems without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 4110 humansegmented Bengali words, our algorithm achieves an F-score of 83%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological parsers, by about 23%. Response to Reviewers: We have added reviewers' comments as an attachment to the manuscript. Unsupervised Morphological Parsing of Bengali Sajib Dasgupta and Vincent Ng ({sajib,vince}@hlt.utdallas.edu) Human Language Technology Research Institute, University of Texas at Dallas, Richardson, TX 75083, USA Abstract. Unsupervised morphological analysis is the task of segmenting words into prefixes, Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and stems without prior knowledge of language-specific morphotactics and morphophonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 4110 human-segmented Bengali words, our algorithm achieves an F-score of 83%, substantially outperforming Linguistica, one of the most widelyused unsupervised morphological parsers, by about 23%.
منابع مشابه
High-Performance, Language-Independent Morphological Segmentation
This paper introduces an unsupervised morphological segmentation algorithm that shows robust performance for four languages with different levels of morphological complexity. In particular, our algorithm outperforms Goldsmith’s Linguistica and Creutz and Lagus’s Morphessor for English and Bengali, and achieves performance that is comparable to the best results for all three PASCAL evaluation da...
متن کاملUnity in Diversity: A Unified Parsing Strategy for Major Indian Languages
This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu and Malayalam belong to the Dravidian family. While little work has been done previously on Bengali and Telugu linear transition-based parsing, w...
متن کاملA Hybrid Model for Part-of-Speech Tagging and its Application to Bengali
— This paper describes our work on Bengali Part of Speech (POS) tagging using a corpus-based approach. There are several approaches for part of speech tagging. This paper deals with a model that uses a combination of supervised and unsupervised learning using a Hidden Markov Model (HMM). We make use of small tagged corpus and a large untagged corpus. We also make use of Morphological Analyzer. ...
متن کاملExample Based English-Bengali Machine Translation Using WordNet
In this paper we propose an architecture of EnglishBengali Example Based Machine Translation (EBMT) using WordNet. The proposed EBMT system has five steps: 1) Tagging 2) Parsing 3) Prepare the chunks of the sentence using sub-sentential EBMT 4) Using an efficient adapting scheme, match the sentence rule 5) Translate from Source Language (English) to Target Language (Bengali) in the chunk and ge...
متن کاملUnsupervised Morphological Segmentation with Recursive Neural Network
Motivated by (Socher et al., 2010; 2011)’s work in syntactic parsing of natural language sentences, where the input is a sequence of words, our goal is to learn similar hierarchical parse trees but for words instead, treating each character as a unit. By recursively grouping characters together, we aim to achieve unsupervised learning of not only the shallow morphological segmentation, i.e. bre...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Language Resources and Evaluation
دوره 40 شماره
صفحات -
تاریخ انتشار 2006